This module will cover exploratory data analysis and largely follows
the Exploratory Data Analysis chapter in R for Data
Science. Exploratory data analysis is a systematic iterative process
of data visualization and transformation to get a better understanding
of your data. During this process, you will:
For this module, we’ll be using the diamonds data set
from the dplyr package in tidyverse. We’ll
rely heavily on the dplyr package for data transformation
and ggplot2 for data visualization. Let’s load all relevant
packages and take a look at the first 100 rows of that data set before
we get started!
require(tidyverse)
require(DT) # To print data frames
diamonds %>%
head(n = 100) %>%
datatable(options = list(pageLength = 100, scrollX = 250, scrollY = 300, dom = "tpi"))
And let’s look at the structure of the data set as well…
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
And we can also consult the help pages (?diamonds) to
get additional information of the data set.
There are two basic questions that underlie almost all exploratory data analyses:
Variation refers to the tendency of the values of a variable to change from measurement to measurement.
Covariation refers to the tendency for values of two or more values to vary together in a related way.
We can think of variation as reflecting uncertainty and covariation as a means to reduce that uncertainty.
The method of visualizing a distribution depends on whether the variable you’re visualizing is discrete or continuous.
For discrete variables, a bar chart is most relevant, so
geom_bar is the most commonly used
ggplot(data = diamonds) +
geom_bar(aes(x = cut))
We could also compute the count values depicted in the figure above
manually using the count function.
diamonds %>%
count(cut)
## # A tibble: 5 × 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
Count is a helpful function to quickly discern the number of unique
values of one or more variables. Note that if we wanted to, we could
supply multiple variables to the count function and get the
unique values across both variables, like so:
diamonds %>%
count(cut, color)
## # A tibble: 35 × 3
## cut color n
## <ord> <ord> <int>
## 1 Fair D 163
## 2 Fair E 224
## 3 Fair F 312
## 4 Fair G 314
## 5 Fair H 303
## 6 Fair I 175
## 7 Fair J 119
## 8 Good D 662
## 9 Good E 933
## 10 Good F 909
## # … with 25 more rows
We have more options in visualizing a continuous variable than a
discrete one. Probably the most typical method is a histogram, which can
be generated using the geom_histogram function in
ggplot2
ggplot(data = diamonds) +
geom_histogram(aes(x = carat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Note the warning message about binwidth. This message is telling us
that by default, ggplot creates 30 bins (discrete equal interval regions
in a variable’s distribution) within which to calculate the frequency
count of values. We can change this default setting using the
binwidth argument in geom_histogram. The
binwidth argument requires us to specify how wide the bins
should be in the raw score metric of the variable (in this case, in
carats). To determine an appropriate bin width it helps to know the
range of the distribution.
diamonds %>%
summarize(min = min(carat, na.rm = TRUE),
max = max(carat, na.rm = TRUE))
## # A tibble: 1 × 2
## min max
## <dbl> <dbl>
## 1 0.2 5.01
The range is roughly between 0 and 5. Let’s change the binwidth to .25. This will yield 5 / .25 = 20 bins.
ggplot(data = diamonds) +
geom_histogram(aes(x = carat), binwidth = .25)
Notice how the bins are a little wider than the default of 30 bins above. I actually prefer the default 30 bins in this case. Let’s try a thinner binwidth.
ggplot(data = diamonds) +
geom_histogram(aes(x = carat), binwidth = .01)
This plot has 5 / .01 = 500 bins, and I think it’s an improvement
over our previous two plots. Note that the binwidth
argument was specified outside of the aes function because
we’re not mapping bindwidth to a variable in our diamonds
data frame.
In addition to a histogram, we could also visualize a continuous distribution using frequency polygons or a kernel density plot.
Below is a frequency polygon. It’s largely the same as a histogram except instead of drawing frequency counts in bins using bars, it draws frequency counts in bins using lines.
ggplot(data = diamonds) +
geom_freqpoly(aes(x = carat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can specify a narrower bindwidth in geom_freqpoly,
just as with geom_histogram as well.
ggplot(data = diamonds) +
geom_freqpoly(aes(x = carat), binwidth = .01)
And note that we can also map additional aesthetics to other
variables in geom_freqpoly, too.
ggplot(data = diamonds) +
geom_freqpoly(aes(x = carat, color = cut), binwidth = .1)
We can also visualize a continuous distribution using a kernel
density plot via geom_density.
ggplot(data = diamonds) +
geom_density(aes(x = carat))
We can also visualize this kernel density plot as a function of a second variable using aesthetic specifications. Here, we’ll use fill because we want to fill all the area under the curve.
ggplot(data = diamonds) +
geom_density(aes(x = carat, fill = cut))
That’s kind of neat looking, but hard to interpret because the ideal
cut covers over the others. Let’s add some transparency to this plot
using the alpha argument to improve visibility. We’ll also
map color to cut so that the curve perimeter lines are the same color as
the area under the curves. Note again the placement of alpha outside of
aes because it’s not mapped to a variable in the
diamonds data frame.
ggplot(data = diamonds) +
geom_density(aes(x = carat, fill = cut, color = cut), alpha = .4)
There are several other geoms that can be used to visualize
univariate continuous distributions, including geom_area,
geom_dotplot, and geom_qq. See the ggplot2
cheat sheet and help documentation for more details.
A couple things to note about the distribution of this variable:
Whole integer and half-integer values appear to be more common than other values
There are more values to the right of the whole integer values than to the left
Small carat diamonds (those =< 1 carat) are common, and large carat diamonds (those greater than say 2 carats) are exceedingly rare
Another important aspect of univariate data screening is identifying outliers. Outliers may be extreme but valid values in the variable’s distribution or they may be data entry errors. Our approach to dealing with outliers will differ depending on the source of the outlying value, so it’s important to bring to bear our prior knowledge of the variable to our consideration of outliers.
Let’s take a look at the y variable in the diamonds data
set. According to the help documentation, the y variable reflects the
width (in mm) of diamonds. Let’s visualize the distribution of this
variable using a histogram
ggplot(data = diamonds) +
geom_histogram(aes(x = y), binwidth = 0.5)
It’s hard to see if there are any outliers in this distribution
because there are so many obsevations in the bins. We’ll need to zoom in
on specific regions of the distribution to better visualize outlying
values. To zoom in on a plot, we need to use the
coord_cartesion function. This function includes
xlim and ylim arguments, which help us specify
which regions of the plot to zoom in on. In the plot below, we’ll limit
the y-axis to be between 0 and 50, which will allow us to visualize
uncommon observations in the distribution.
ggplot(data = diamonds) +
geom_histogram(aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
Now we see that there appear to be three uncommon values (low frequency occurrence) values in the distribution: 0, ~30, and ~60.
Let’s isolate those values in our data set and take a look at them
diamonds %>%
filter(y == 0 | y > 20) %>%
arrange(y)
## # A tibble: 9 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Very Good H VS2 63.3 53 5139 0 0 0
## 2 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 3 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 4 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 5 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 6 0.71 Good F SI2 64.1 60 2130 0 0 0
## 7 0.71 Good F SI2 64.1 60 2130 0 0 0
## 8 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12
## 9 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06
Based on our background knowledge, we know that diamonds cannot have a width of 0 mm. In addition, diamonds with a width of 31.8 mm or 58.9 mm would be exceedingly large (> 1 inch) and should be very expensive, but these two diamonds are relatively modestly priced. It looks like all of these values are likely to be data entry errors and so we should be quite suspicious of them.
Once we’ve detected outliers and reached a conclusion as their source, we need to deal with these values so they don’t muddle our subsequent analyses.
There are two approaches for dealing with outliers. One approach is to filter out the whole case associated with those suspicious outlying values. We could use the following code to do so:
diamonds2 <- diamonds %>%
filter(y != 0 & y != 31.8 & y != 58.9)
Note that we could perform the same filter using the following code as well
diamonds2 <- diamonds %>%
filter(! y %in% c(0, 31.8, 58.9))
diamonds2 <- diamonds %>%
filter(between(y, 3, 20))
In general, I prefer not to pursue this approach because we’re removing all data associated with that case. Instead, it’s often better to remove the specific outlying values associated with a case by setting them to NA. In so doing, we retain all of the other information from that case, which would do not have reason to believe originated from a data entry error.
diamonds2 <- diamonds %>%
mutate(y = case_when(
y == 0 | y == 31.8 | y == 58.9 ~ NA_real_,
TRUE ~ y
))
We used the case_when function here to replace these
values with NAs. Note that we had to specify the data type of the
resulting variable when we converted to NA. This is because by default
NA values are of type logical; however, the other values in this
variable are double. Recall, that all values within a column in a data
frame must be of the same type (although values in different columns can
be of different type). So, we have to specify what type we want the NA
value to be so that it doesn’t throw up an error. For double (numeric)
values, this is “NA_real_”. Then, to keep all other values in the y
variable the same, we use “TRUE ~ y”. This is saying that for all values
that don’t satisfy the first logical proposition (i.e., the y == 0 | y
== 31.8 | y == 58.9), keep them as is.
Note that one helpful feature of ggplot2 is that it will
throw up a warning message when you have missing values. It won’t plot
those values because it doesn’t have a place to put them, though.
ggplot(data = diamonds2, aes(x = x, y = y)) +
geom_point()
## Warning: Removed 9 rows containing missing values (`geom_point()`).
The best way to visualize covariation depends on the type of variables that we’re evaluating. So, we’ll divide this section by type of variables.
The best way to visualize covariation between a continuous and discrete variable is to depict the continuous variable at each level of the discrete variable. For this example, we’ll look at variability of price at different levels of cut for diamonds in the diamonds data set.
ggplot(data = diamonds) +
geom_freqpoly(aes(x = price, color = cut))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This frequency polygon visualization is not ideal because the sizes of cut groups differed greatly and the frequency polygon presented simple counts.
diamonds %>%
count(cut)
## # A tibble: 5 × 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
Instead, what we’d like to do is normalize the count for each cut level so that they’re all on the same scale. We can do so by plotting density instead of count on the y-axis (in this case, using a density plot renders the area under the density curve to 1 for each cut level, so all levels of cut will be on the same scale).
ggplot(data = diamonds) +
geom_freqpoly(aes(x = price, y = ..density.., color = cut))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This plot reveals something surprising…fair diamonds (the lowest quality) appear to be the most expensive! We’ll come back to this surprising result later.
Another way to visualize a continuous variable across levels of a categorical variable is using a boxplot. Below is a depiction of the summarized quantities in a boxplot (from R for Data Science):
Let’s take a look at a boxplot of price by cut using
geom_boxplot
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
We lose a little bit of information about these distributions using boxplots, but note that we can still tell that fair diamonds tend to be more expensive than higher quality diamonds (especially ideal diamonds), and that distributions of price are positively skewed, especially for higher quality diamonds.
Note that we could easily flip the x- and y-axis of this plot using
coord_flip. This is something I often like to do when I
have categorical labels on the x-axis, especially when those labels are
long and consequently work better when presented horizontally on the
y-axis.
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot() +
coord_flip()
In this instance, the coordinate flipping doesn’t meaningfully improve the plot.
To visualize covariation between two categorical variables, we need
to count the number of instances for each combination of the two
variables. One way to do this is to use geom_count, which
will present the number of observations at each combination of values by
the size of the resulting circles.
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
We can also compute the count with dplyr
diamonds %>%
count(color, cut)
## # A tibble: 35 × 3
## color cut n
## <ord> <ord> <int>
## 1 D Fair 163
## 2 D Good 662
## 3 D Very Good 1513
## 4 D Premium 1603
## 5 D Ideal 2834
## 6 E Fair 224
## 7 E Good 933
## 8 E Very Good 2400
## 9 E Premium 2337
## 10 E Ideal 3903
## # … with 25 more rows
And then we can use those counts in a visualization using
geom_tile
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
Note that we’ve now scaled color to frequency count.
Probably the most common way to visualize continuous x continuous covariation is using a scatterplot, like so
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
Here we can see an exponential relationship between carat and price (price increases exponentially as a function of carat). The plot is somewhat difficulty to discern because there are so many data points. We can adjust the transparency of the points to help visualize these data points better.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = .01)
Another approach to visualizing large data sets is to use binning. We
used binning previously with histograms and frequency polygons, but now
we’ll apply it in two dimensions using geom_bind2d
ggplot(data = diamonds) +
geom_bin2d(mapping = aes(x = carat, y = price))
# install.packages("hexbin")
ggplot(data = diamonds) +
geom_hex(mapping = aes(x = carat, y = price))
geom_bin2d and geom_hex both divide the
coordinate plane into 2D bins and then scale the frequency count to
color. geom_bind2d creates rectangular bins and
geom_hex creates hexagonal bins. Note that you can change
the number of bins using the bins argument in
geom_bin2d and geom_hex
ggplot(data = diamonds) +
geom_bin2d(mapping = aes(x = carat, y = price), bins = 100)
ggplot(data = diamonds) +
geom_hex(mapping = aes(x = carat, y = price), bins = 100)
An alternative way to visualize large data is to bin one of the continuous variables so it acts like a categorical variable. Then you can use one of the techniques for visualizing continuous x categorical covariation, such as a boxplot.
To do so, we’ll use the cut_width function, which
divides up a continuous variable into bins of a specified width
table(cut_width(runif(1000), width = 0.1, boundary = 0)) %>%
knitr::kable("markdown")
| Var1 | Freq |
|---|---|
| [0,0.1] | 97 |
| (0.1,0.2] | 98 |
| (0.2,0.3] | 107 |
| (0.3,0.4] | 99 |
| (0.4,0.5] | 116 |
| (0.5,0.6] | 108 |
| (0.6,0.7] | 88 |
| (0.7,0.8] | 99 |
| (0.8,0.9] | 85 |
| (0.9,1] | 103 |
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
We could also use cut_number, which will create bins
with equal numbers of cases within each bin
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, n = 20)))
We won’t be doing much modeling in this course, but I wanted to introduce it here because using simple models can be an important part of exploratory data analysis. Earlier, we encountered the paradoxical finding that lower quality diamonds were associated with higher price. We’ve also seen that larger diamonds (higher carat) are also more expensive. In addition, as can be seen below, lower quality diamonds tend to be larger than higher quality diamonds.
ggplot(diamonds) +
geom_boxplot(aes(x = cut, y = carat))
It’s therefore possible that lower quality diamonds are more expensive because they tend to be larger than higher quality diamonds. We may be able to account for this paradoxical finding, then, by adjusting for diamond size before evaluating the association of diamond quality and price.
To adjust for diamond size, we’ll retain the residuals from a model predicting diamond price from diamond size (carat). We’ll place price and carat on log scales to properly account for the exponential association observed between them.
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = price), alpha = .05)
require(modelr)
## Loading required package: modelr
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid), alpha = .05)
After we’ve created this residualized price variable, we can then explore the association of residualized price and diamond quality
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
And now we see the expected relationship between diamond quality and diamond price!